Cluster Based Symbolic Representation for Skewed Text Categorization
نویسندگان
چکیده
In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models.
منابع مشابه
Distributional Clustering of Words for Text Categorization Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science
We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. The word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high per...
متن کاملDistributional Word Clusters vs. Words for Text Categorization
We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high pe...
متن کاملText Categorization Experiments Using Wikipedia
Over the years many models had been proposed for text categorization. One of the most widely applied is the vector space model, assuming independence between indexing terms. Since training corpora sizes are relatively small – compared to ∞ – the generalization power of the learning algorithms is relatively low. Using a bigger unannotated text corpus can boost the representation and hence the le...
متن کاملKernel PCA based clustering for inducing features in text categorization
We study dimensionality reduction or feature selection in text document categorization problem. We focus on the first step in building text categorization systems, that is the choice of efficiently representing numerically the natural language text. This numerical representation is going to be used by machine learning algorithms. We propose a representation based on word clusters. We build a ke...
متن کاملMining and its Application in Biomedical Domain
Semantic Text Mining and its Application in Biomedical Domain Illhoi Yoo Xiaohua Hu, Ph.D A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE, because the most natural form to store information is text. In order to cope with this pressing text information overload, text mining is employed. However, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016